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Abstract 

This paper presents a comparative evalua- 
tion among the systems that participated 
in the Spanish and English lexical sam- 
ple tasks of Senseval-2. The focus is 
on pairwise comparisons among systems 
to assess the degree to which they agree, 
and on measuring the difficulty of the test 
instances included in these tasks. 

1 Introduction 

This paper presents a post-mortem analysis of 
the English and Spanish lexical sample tasks of 
Senseval-2. Two closely related questions are 
considered. First, to what extent did the compet- 
ing systems agree? Did systems tend to be redun- 
dant and have success with many of the same test in- 
stances, or were they complementary and able to dis- 
ambiguate different portions of the instance space? 
Second, how much did the difficulty of the test in- 
stances vary? Are there test instances that proved 
unusually difficult to disambiguate relative to other 
instances? 

We address the first question via a series of pair- 
wise comparisons among the participating systems 
that measures their agreement via the kappa statis- 
tic. We also introduce a simple measure of the de- 
gree to which systems are complementary called op- 
timal combination. We analyze the second question 
by rating the difficulty of test instances relative to the 
number of systems that were able to disambiguate 
them correctly. 



Nearly all systems that received official scores 
in the Spanish and English lexical sample tasks of 
Senseval-2 are included in this study. There are 
23 systems included from the English lexical sam- 
ple task and eight from the Spanish. Table [T] lists the 
systems and shows the number of test instances that 
each disambiguated correctly, both by part of speech 
and in total. 

2 Pairwise System Agreement 

Assessing agreement among systems sheds light on 
whether their combined performance is potentially 
more accurate than that of any of the individual sys- 
tems. If several systems are largely in agreement, 
then there is little benefit in combining them since 
they are redundant and will simply reinforce one an- 
other. However, if some systems disambiguate in- 
stances that others do not, then they are complemen- 
tary and it may be possible to combine them to take 
advantage of the different strengths of each system 
to improve overall accuracy. 

The kappa statistic dCohen, 1960 ) is a measure of 
agreement between multiple systems (or judges) that 
is scaled by the agreement that would be expected 
just by chance. A value of 1.00 suggests complete 
agreement, while 0.00 indicates pure chance agree- 
ment. Negative values indicate agreement less than 
what would be expected by chance. (Krippendorf, 
1980) points out that it is difficult to specify a par- 
ticular value of kappa as being generally indicative 
of agreement. As such we simply use kappa as a 
tool for comparison and relative ranking. A detailed 
discussion on the use of kappa in natural language 



processing is presented in (Carletta, 1996). 



Table 1: Lexical Sample Systems 



system 




correct instances 


name 


noun 


verb 


adj 


total (%) 


English 




1 / j4 


1 QA£ 


/Do 


/I^JOQ (\ C\C\\ 


jhu_final 


1 1 OA 

i iyo 


lUzz 


JOZ 


Z/oU (U.04J 


smuls 


iziy 


1 A1 A 

lUlo 


CTO 

JZO 


OTA 1 } /A 

z/o3 (U.o4; 


kunlp 


11/1 


1 A/1 A 

1U4U 


jl J 


Z/Z4 


cszz4n 


i lyo 


O/l < 


JZ / 


oata /a ao\ 
ZD /U (U.OZ) 


lia 


1 1 T7 
11// 


966 


C 1 A 
MO 


0£C2 /A ^1 \ 


talp 


1 1 /IO 

i i4y 


yz/ 


4yj 


OCT! /A <0\ 

z3 / 1 (U. jy; 


aulutnj 


113/ 


O/IA 

o4U 


4y / 


Z4/3 (U.J / ; 


umcp 


1 AO 1 

lUol 


OO 1 

syi 


48 / 


0/1 <0 /A <0"\ 

z4jV {yj.j /) 


ehu_all 


1 AAO 

1069 


OO 1 

o91 


A OA 

4SU 


z44U (U.jd) 


ULULUIrH- 


1UUJ 


oUO 




ZJHD \\J.J L r) 


duluth2 


1056 


795 


483 


2334 (0.54) 


lesk_corp 


960 


804 


454 


2218 (0.51) 


duluthB 


1004 


729 


467 


2200(0.51) 


uned_ls_t 


987 


699 


469 


2155 (0.50) 


common 


880 


728 


453 


2061 (0.48) 


alicante 


427 


866 


486 


1779 (0.41) 


uned_ls_u 


781 


519 


437 


1736 (0.40) 


clr_ls 


602 


393 


272 


1267 (0.29) 


iit2 


541 


348 


166 


1054 (0.24) 


iitl 


516 


337 


182 


1034 (0.24) 


lesk 


467 


328 


182 


977 (0.23) 


lesk_def 


438 


159 


108 


704(0.16) 


random 


303 


153 


155 


611 (0.14) 


Spanish 




799 


745 


681 


2225 (1.00) 


jhu 


560 


478 


546 


1584 (0.71) 


cs224 


520 


443 


526 


1489 (0.67) 


umcp 


482 


435 


479 


1396 (0.63) 


duluth8 


494 


382 


494 


1369 (0.62) 


duluth7 


470 


374 


480 


1324 (0.60) 


duluth9 


445 


359 


446 


1250 (0.56) 


duluthY 


411 


325 


434 


1170 (0.53) 


alicante 


269 


381 


468 


1118 (0.50) 



To study agreement we have made a series of pair- 
wise comparisons among the systems included in the 
English and Spanish lexical sample tasks. Each pair- 
wise combination is represented in a 2 x 2 contin- 
gency table, where one cell represents the number of 
test instances that both systems disambiguate cor- 
rectly, one cell represents the number of instances 
where both systems are incorrect, and there are two 
cells to represent the counts when only one system is 
correct. Agreement does not imply accuracy, since 
two systems may get a large number of the same in- 
stances incorrect and have a high rate of agreement. 

Tables || and || show the system pairs in the 
English and Spanish lexical sample tasks that ex- 
hibit the highest level of agreement according to the 
kappa statistic. The values in the both-one-zero col- 
umn indicate the percentage of instances where both 
systems are correct, where only one is correct, and 
where neither is correct. The top 15 pairs are shown 
for nouns and verbs, and the top 10 for adjectives. 
A complete list would include about 250 pairs for 
each part of speech for English and 24 such pairs for 
Spanish. 

The utility of kappa agreement is confirmed in 
that system pairs known to be very similar have cor- 
respondingly high measures. In Table |2], duluth2 
and duluth3 exhibit a high kappa value for all parts 
of speech. This is expected since duluth3 is an en- 
semble approach that includes duluth2 as one of its 
members. The same relationship exists between du- 
luth7 and duluth8 in the Spanish lexical sample, and 
comparable behavior is seen in Table H 

A more surprising case is the even higher level of 
agreement between the most common sense base- 
line and the lesk corpus baseline shown in Table 2. 
This is not necessarily expected, and suggests that 
lesk corpus may not be finding a significant number 
of matches between the Senseval contexts and the 
WordNet glosses (as the lesk algorithm would hope 
to do) but instead may be relying on a simple default 
in many cases. 

In previous work ( pedersen, 2001 ) we propose a 
50-25-25 rule that suggests that about half of the in- 
stances in a supervised word sense disambiguation 
evaluation will be fairly easy for most systems to 
resolve, another quarter will be harder but possible 
for at least some systems, and that the final quar- 
ter will be very difficult for any system to resolve. 



Table 2: Pairwise Agreement English Table 3: Pairwise Agreement Spanish 



system pair 


both-one-zero 


kappa 


system pair 


both-one-zero 


kappa 


Nouns 






Nouns 




common lesk_corp 


0.49 0.06 0.44 


0.87 


duluth7 duluth8 


0.57 0.11 0.32 


0.76 


duluth2 duluth3 


0.60 0.08 0.32 


0.82 


umcp duluth9 


0.50 0.17 0.33 


0.65 


lesk_corp umcp 


0.53 0.11 0.36 


0.78 


duluth7 duluthY 


0.49 0.21 0.30 


0.57 


duluth2 duluthB 


0.54 0.14 0.32 


0.70 


umcp duluthY 


0.48 0.21 0.31 


0.56 


iitl iit2 


0.24 0.13 0.63 


0.69 


duluth8 duluthY 


0.50 0.21 0.29 


0.56 


duluth3 duluthB 


0.56 0.15 0.29 


0.68 


umcp duluth8 


0.50 0.23 0.27 


0.51 


common umcp 


0.48 0.16 0.36 


0.68 


umcp duluth7 


0.50 0.23 0.27 


0.51 


ehu_all umcp 


0.55 0.17 0.29 


0.64 


cs224 umcp 


0.51 0.24 0.25 


0.49 


uned_ls_t uned_ls_u 


0.41 0.19 0.40 


0.63 


duluth9 duluthY 


0.44 0.26 0.30 


0.47 


duluth3 duluth4 


0.55 0.18 0.27 


0.61 


duluth8 duluth9 


0.47 0.27 0.27 


0.45 


duluth2 duluth4 


0.52 0.19 0.29 


0.60 


cs224 duluth9 


0.47 0.27 0.26 


0.44 


duluth4 duluthB 


0.51 0.19 0.29 


0.59 


cs224 jhu 


0.55 0.25 0.20 


0.43 


ehu_all lesk_corp 


0.50 0.20 0.30 


0.59 


cs224 duluth8 


0.51 0.27 0.22 


0.42 


cs224n duluth3 


0.58 0.19 0.24 


0.58 


jhu umcp 


0.51 0.29 0.21 


0.38 


cs224n duluth4 


0.55 0.19 0.25 


0.58 


jhu duluth8 


0.53 0.28 0.19 


0.37 




Verbs 






Verbs 




common lesk_corp 


0.39 0.06 0.55 


0.88 


duluth7 duluth8 


0.48 0.08 0.44 


0.84 


duluth2 duluth3 


0.43 0.07 0.50 


0.85 


duluth8 duluth9 


0.44 0.14 0.42 


0.72 


duluth3 duluth4 


0.39 0.14 0.47 


0.72 


umcp duluth8 


0.48 0.14 0.37 


0.71 


duluth2 duluth4 


0.38 0.15 0.47 


0.69 


umcp duluth9 


0.46 0.16 0.38 


0.69 


lesk_corp umcp 


0.38 0.17 0.44 


0.65 


duluth7 duluth9 


0.42 0.16 0.42 


0.68 


common umcp 


0.36 0.18 0.46 


0.65 


umcp duluth7 


0.47 0.16 0.37 


0.68 


cs224n duluth3 


0.40 0.20 0.40 


0.60 


duluth8 duluthY 


0.44 0.16 0.39 


0.67 


cs224n duluth4 


0.38 0.21 0.41 


0.59 


duluth9 duluthY 


0.42 0.18 0.40 


0.65 


cs224n duluth2 


0.38 0.21 0.40 


0.57 


duluth7 duluthY 


0.43 0.18 0.39 


0.64 


uned_ls_t uned_ls_u 


0.24 0.20 0.55 


0.56 


umcp duluthY 


0.46 0.19 0.35 


0.61 


duluth3 lia 


0.39 0.23 0.38 


0.54 


cs224 umcp 


0.49 0.19 0.32 


0.61 


lesk_corp talp 


0.36 0.23 0.40 


0.54 


cs224 duluth8 


0.44 0.25 0.32 


0.50 


cs224n lia 


0.41 0.24 0.35 


0.53 


alicante umcp 


0.42 0.25 0.33 


0.50 


common talp 


0.34 0.24 0.42 


0.52 


cs224 duluth7 


0.43 0.26 0.32 


0.48 


kunlp talp 


0.42 0.24 0.33 


0.51 


cs224 jhu 


0.49 0.25 0.26 


0.48 


Adjectives 




Adjectives 




common lesk_corp 


0.59 0.00 0.41 


0.99 


duluth7 duluth8 


0.69 0.06 0.25 


0.85 


duluth2 duluth3 


0.63 0.03 0.34 


0.93 


duluth7 duluthY 


0.60 0.14 0.26 


0.68 


lesk_corp umcp 


0.58 0.07 0.35 


0.86 


umcp duluthY 


0.60 0.15 0.26 


0.67 


duluth2 duluthB 


0.59 0.07 0.35 


0.86 


umcp duluth9 


0.61 0.14 0.25 


0.67 


duluth3 duluthB 


0.60 0.07 0.34 


0.86 


duluth8 duluthY 


0.61 0.15 0.24 


0.67 


common umcp 


0.58 0.07 0.35 


0.86 


umcp duluth8 


0.64 0.15 0.21 


0.64 


duluth4 duluthB 


0.55 0.14 0.32 


0.71 


duluth9 duluthY 


0.56 0.18 0.26 


0.60 


duluth3 duluth4 


0.57 0.14 0.30 


0.71 


duluth8 duluth9 


0.61 0.17 0.22 


0.59 


cs224n duluth3 


0.60 0.13 0.27 


0.70 


umcp duluth7 


0.62 0.17 0.21 


0.59 


cs224n duluth2 


0.59 0.14 0.27 


0.70 


duluth7 duluth9 


0.58 0.20 0.22 


0.54 



This same idea could also be expressed by stating 
that the kappa agreement between two word sense 
disambiguation systems will likely be around 0.50. 
In fact this is a common result in the full set of pair- 
wise comparisons, particularly for overall results not 
broken down by part of speech. Tables 2 and 3 only 
list the largest kappa values, but even there kappa 
quickly reduces towards 0.50. These same tables 
show that it is rare for two systems to agree on more 
than 60% of the correctly disambiguated instances. 

3 Optimal Combination 

An optimal combination is the accuracy that could 
be attained by a hypothetical tool called an optimal 
combiner that accepts as input the sense assignments 
for a test instance as generated by several different 
systems. It is able to select the correct sense from 
these inputs, and will only be wrong when none of 
the sense assignments is the correct one. Thus, the 
percentage accuracy of an optimal combiner is equal 
to one minus the percentage of instances that no sys- 
tem can resolve correctly. 

Of course this is only a tool for thought experi- 
ments and is not a practical algorithm. An optimal 
combiner can establish an upper bound on the accu- 
racy that could reasonably be attained over a partic- 
ular sample of test instances. 

Tables ^| and |5| list the top system pairs ranked 
by optimal combination (1.00 - value in zero col- 
umn) for the English and Spanish lexical samples. 
Kappa scores are also shown to illustrate the inter- 
action between agreement and optimal combination. 
Optimal combination is maximized when the per- 
centage of instances where both systems are wrong 
is minimized. Kappa agreement is maximized by 
minimizing the percentage of instances where one 
or the other system (but not both) is correct. Thus, 
the only way a system pair could have a high mea- 
sure of kappa and a high measure of optimal com- 
bination is if they were very accurate systems that 
disambiguated many of the same test instances cor- 
rectly. 

System pairs with low measures of agreement 
are potentially quite interesting because they are the 
most likely to make complementary errors. For ex- 
ample, in Table [5] under nouns, the alicante system 
has a low level of agreement with all of the other 



Table 4: Optimal Combination English 



system pair 


both-one-zero 


kappa 




Nouns 




kunlp smuls 


0.49 0.39 0.12 


0.11 


smuls talp 


0.48 0.39 0.13 


0.11 


cs224n kunlp 


0.48 0.39 0.13 


0.11 


ehu_all smuls 


0.47 0.40 0.13 


0.10 


cs224n talp 


0.48 0.39 0.14 


0.13 


jhu_final kunlp 


0.49 0.37 0.14 


0.16 


smuls umcp 


0.45 0.41 0.14 


0.10 


kunlp lia 


0.48 0.38 0.14 


0.14 


lia talp 


0.47 0.39 0.14 


0.13 


jhu_nnal talp 


0.48 0.38 0.14 


0.15 


duluth3 kunlp 


0.47 0.38 0.15 


0.15 


cs224n ehu_all 


0.48 0.38 0.15 


0.16 


ehu_all lia 


0.47 0.38 0.15 


0.15 


ehu_all jhu_frnal 


0.48 0.36 0.16 


0.19 


duluth3 talp 


0.47 0.38 0.16 


0.17 




Verbs 




jhu_final kunlp 


0.34 0.46 0.20 


0.06 


ehu_all jhu_frnal 


0.31 0.44 0.21 


0.07 


ehu_all smuls 


0.31 0.44 0.21 


0.07 


ehu_all kunlp 


0.33 0.41 0.22 


0.13 


kunlp smuls 


0.36 0.43 0.22 


0.13 


cs224n ehu_all 


0.29 0.45 0.22 


0.05 


ehu_all lia 


0.30 0.44 0.22 


0.08 


cs224n kunlp 


0.33 0.44 0.23 


0.11 


alicante ehu_all 


0.26 0.47 0.23 


0.03 


kunlp lia 


0.34 0.42 0.23 


0.14 


jhu_final talp 


0.32 0.44 0.24 


0.12 


duluth3 ehu_all 


0.26 0.46 0.24 


0.05 


ehu_all talp 


0.30 0.41 0.24 


0.13 


alicante jhu_final 


0.30 0.45 0.24 


0.09 


jhu_final umcp 


0.30 0.45 0.24 


0.09 


Adjectives 




alicante jhu_final 


0.46 0.37 0.08 


0.03 


alicante smuls 


0.41 0.41 0.09 


-0.04 


alicante cs224n 


0.42 0.40 0.09 


-0.01 


alicante kunlp 


0.41 0.39 0.11 


0.03 


alicante lia 


0.41 0.39 0.11 


0.03 


alicante duluth3 


0.40 0.40 0.11 


0.02 


alicante talp 


0.40 0.41 0.11 


0.02 


alicante ehu_all 


0.41 0.39 0.11 


0.05 


alicante umcp 


0.39 0.40 0.12 


0.04 


alicante duluth2 


0.39 0.40 0.12 


0.03 



Table 5: Optimal Combination Spanish 



system pair 


both-one-zero 


kappa 




Nouns 




alicante jhu 


0.29 0.32 0.11 


0.06 


alicante duluth7 


0.27 0.34 0.12 


0.03 


alicante duluthY 


0.25 0.35 0.12 


0.01 


alicante duluth8 


0.28 0.32 0.13 


0.08 


alicante cs224 


0.28 0.32 0.13 


0.09 


alicante umcp 


0.26 0.33 0.14 


0.06 


alicante duluth9 


0.26 0.31 0.16 


0.14 


jhu duluthY 


0.46 0.36 0.18 


0.24 


jhu duluth7 


0.51 0.29 0.19 


0.35 


jhu duluth8 


0.53 0.28 0.19 


0.37 


cs224 jhu 


0.55 0.25 0.20 


0.43 


jhu duluth9 


0.46 0.34 0.20 


0.29 


jhu umcp 


0.51 0.29 0.21 


0.38 


cs224 duluth7 


0.49 0.30 0.22 


0.36 


cs224 duluth8 


0.51 0.27 0.22 


0.42 




Verbs 




jhu duluthY 


0.39 0.38 0.23 


0.23 


jhu umcp 


0.48 0.27 0.25 


0.44 


jhu duluth9 


0.39 0.36 0.26 


0.29 


jhu duluth8 


0.42 0.32 0.26 


0.35 


cs224 jhu 


0.49 0.25 0.26 


0.48 


jhu duluth7 


0.42 0.32 0.26 


0.36 


alicante jhu 


0.45 0.26 0.29 


0.47 


cs224 duluthY 


0.43 0.27 0.30 


0.46 


alicante cs224 


0.41 0.28 0.31 


0.44 


alicante duluthY 


0.35 0.34 0.31 


0.32 


cs224 umcp 


0.49 0.19 0.32 


0.61 


cs224 duluth7 


0.43 0.26 0.32 


0.48 


cs224 duluth8 


0.44 0.25 0.32 


0.50 


cs224 duluth9 


0.41 0.27 0.32 


0.47 


alicante umcp 


0.42 0.25 0.33 


0.50 


Adjectives 




jhu duluth8 


0.66 0.22 0.12 


0.39 


jhu duluth7 


0.64 0.24 0.12 


0.36 


jhu duluthY 


0.56 0.31 0.12 


0.25 


alicante jhu 


0.62 0.26 0.13 


0.33 


jhu duluth9 


0.59 0.29 0.13 


0.29 


cs224 jhu 


0.70 0.16 0.13 


0.51 


jhu umcp 


0.64 0.23 0.13 


0.38 


alicante cs224 


0.61 0.24 0.15 


0.39 


cs224 duluth8 


0.66 0.19 0.16 


0.50 


cs224 duluth7 


0.64 0.20 0.16 


0.49 



Table 6: Difficulty of Instances 



# 


noun 


verb 


adj 


total 






English 









59 (16) 


174 (6) 


29 (8) 


262 (8) 


1 


51 (15) 


116(10) 


26 (14) 


193 (12) 


2 


59 (18) 


122 (12) 


41 (21) 


222 (15) 


3 


64 (19) 


117 (16) 


29 (23) 


210(18) 


4 


84 (17) 


102 (16) 


28 (18) 


214 (17) 


5 


76 (23) 


76(18) 


24 (20) 


176 (21) 


6 


53 (28) 


61 (30) 


23 (31) 


137 (29) 


7 


51 (29) 


65 (22) 


23 (34) 


139 (27) 


8 


62 (27) 


58 (34) 


18 (31) 


138 (30) 


9 


47 (32) 


69 (28) 


17 (26) 


133 (29) 


10 


62 (28) 


61 (32) 


18 (30) 


141 (30) 


11 


55 (39) 


56 (26) 


21 (38) 


132 (34) 


12 


80 (40) 


61 (41) 


22 (35) 


163 (40) 


13 


86 (58) 


56 (34) 


21 (45) 


163 (48) 


14 


125 (65) 


62 (49) 


33 (51) 


220 (59) 


15 


131 (77) 


125 (99) 


36 (60) 


292 (84) 


16 


141 (83) 


107 (117) 


61 (70) 


309 (92) 


17 


133 (75) 


100 (162) 


86 (74) 


319(101) 


18 


92 (73) 


80 (203) 


102 (80) 


274 (113) 


19 


97 (68) 


59 (170) 


49 (77) 


205 (100) 


20 


65 (66) 


38 (192) 


30 (49) 


133 (96) 


21 


42 (68) 


15 (155) 


17 (47) 


74 (79) 


22 


29 (70) 


15 (73) 


7(39) 


51 (67) 


23 


10 (49) 


11 (52) 


7(38) 


28 (47) 






Spanish 









50 (16) 


126 (12) 


52 (24) 


228 (16) 


1 


81 (18) 


63 (17) 


32 (36) 


176 (21) 


2 


63 (24) 


69 (18) 


42 (50) 


174 (28) 


3 


63 (27) 


55 (23) 


39 (81) 


157 (39) 


4 


74 (32) 


47 (23) 


43 (101) 


164 (47) 


5 


94 (35) 


49 (28) 


35 (77) 


178 (42) 


6 


87 (40) 


61 (39) 


57 (90) 


205 (53) 


7 


182 (47) 


94 (46) 


88 (93) 


364 (58) 


8 


105 (44) 


181 (62) 293 (166) 


579(111) 



Table 7: Difficulty of English Word Types 



word-pos (test) 


mean 


word-pos (test) 


mean 


collaborate-v (30) 


OA 1 

zU.z 


circuit-n (85) 


1 A H 

1U. / 


solemn-a (25) 


lo.3 


sense-n (53) 


1 A A 

lU.o 


holiday-n (31) 


Ann 

1 1 . 1 


authority-n (92) 


1 A < 

1U.J 


ayKe-n (Zo j 


1 /.j 


replace-v (45) 


1 A A 

1U.4 


graceful-a (29) 


1 /.3 


restraint-n (45) 


1 A 1 

1U.3 


vital-a (Jo) 


lb. / 


live-v (o /) 


1 A O 
1U.Z 


detention-n (32) 


lO.J 


treat-v (44) 


1 A 1 
1U. 1 


iaitniui-a (z3 ) 


lO.J 


iree-a (oz) 


1 A A 

1U.U 


yew-n (28) 


1 1 

10. 1 


nature-n (46) 


1U.U 


chair-n (69) 


lo.U 


simple-a (66) 


y.o 


ferret- v (1) 


1 A A 

lo.U 


dress-v (59) 


y. / 


Dlma-a (j j ) 


1 < H 
Ij. / 


cool-a (52) 


y. / 


laay-n (jj) 


Ij.J 


oar-n ( l j l ) 


y.j 


spade-n (33) 


1 J. J 


stress-n (39) 


y.j 


neartn-n (3z) 


15.1 


channel-n (73) 


y.z 


lace-v (yj) 


1 j. 1 


matcn-v (4Zj 


y.u 


green-a (94) 


14. y 


natural- a (103) 


O A 

y.u 


fatigue-n (43) 


1 A O 

14. y 


serve-v (51) 


O 
0.0 


oblique-a (29) 


14. J 


train-v (63) 


0. / 


nation-n (37) 


1 a n 

14. U 


post-n y /y) 


8 1 
O. / 


church-n (64) 


11 Q 

1 J.o 


tine-a (/U) 


0.0 


iocai-a (Jo j 


1 J.o 


J^ft , 7 /11\ 

arut-v ( jz j 


1 . 1 


tit-a (zy) 


11/1 
13.4 


leave-v (66) 


1 . 1 


use-v (76) 


1 J.4 


play-v (66) 


1 . J 


cnila-n (04) 


1 1 A 

13. U 


wash-v (12) 


n a 
1 A 


wander-v (50) 


iz.y 


keep-v (67) 


n a 
1 A 


begin-v (ZoU) 


lz.o 


work-v (60) 


H A 
/.U 


bum-n (45) 


12.5 


drive-v (42) 


6.8 


feeling-n (51) 


11.4 


develop-v (69) 


6.6 


facility-n (58) 


11.1 


carry-v (66) 


6.3 


colorless (35) 


11.1 


see-v (69) 


6.3 


grip-n (51) 


11.1 


strike-v (54) 


5.9 


day-n (145) 


11.0 


call-v (66) 


5.8 


mouth-n (60) 


11.0 


pull-v (60) 


5.7 


material-n (69) 


11.0 


turn-v (67) 


5.0 


art-n (98) 


10.7 


draw-v (41) 


4.7 






find-v (68) 


4.2 



Table 8: Difficulty of Spanish Word Types 



word-pos (test) 


mean 


word-pos (test) 


mean 


claro-a (66) 


7.6 


verde-a (33) 


5.3 


local-a (55) 


7.4 


canal-n (41) 


5.3 


popular-a (204) 


7.1 


clavar-v (44) 


5.1 


partido-n (57) 


7.0 


masa-n (41) 


5.1 


bomba-n (37) 


6.8 


apuntar-v (49) 


4.9 


brillante-a (87) 


6.7 


autoridad-n (34) 


4.9 


usar-v (56) 


6.5 


tocar-v (74) 


4.8 


tabla-n (41) 


6.3 


explotar-v (41) 


4.7 


vencer-v (65) 


6.3 


programa-n (47) 


4.7 


simple-a (57) 


6.2 


circuito-n (49) 


4.3 


hermano-n (57) 


6.1 


copiar-v (53) 


4.3 


apoyar-v (73) 


6.0 


actuar-v (55) 


4.2 


vital-a (79) 


5.9 


operacion-n (47) 


4.2 


gracia-n (61) 


5.9 


pasaje-n (41) 


4.1 


organo-n (81) 


5.8 


saltar-v (37) 


4.1 


corona-n (40) 


5.5 


tratar-v (70) 


3.9 


ciego-a (42) 


5.5 


natural-a (58) 


3.9 


corazon-n (47) 


5.5 


grano-n (22) 


3.9 


coronar-v (74) 


5.4 


conducir-v (54) 


3.8 


naturaleza-n (56) 


5.4 







systems. However, the measure of optimal combi- 
nation is quite high, reaching 0.89 (1.00 - 0.11) for 
the pair of alicante and jhu. In fact, all seven of the 
other systems achieve their highest optimal combi- 
nation value when paired with alicante. 

This combination of circumstances suggests that 
the alicante system is fundamentally different than 
the other systems, and is able to disambiguate a cer- 
tain set of instances where the other systems fail. In 
fact the alicante system is different in that it is the 
only Spanish lexical sample system that makes use 
of the structure of Euro-WordNet, the source of the 
sense inventory. 

4 Instance Difficulty 

The difficulty of disambiguating word senses can 
vary considerably. A word with multiple closely re- 
lated senses is likely to be more difficult than one 
with a few starkly drawn differences. In supervised 
learning, a particular sense of a word can be diffi- 
cult to disambiguate if there are a small number of 
training examples available. 

Table ^ shows the distribution of the number of 



instances that are successfully disambiguated by a 
particular number of systems in both the English 
and Spanish lexical samples. The value under the 
# column shows the number of systems that are able 
to disambiguate the number of noun, verb, adjec- 
tive and total instances shown in the row. The aver- 
age number of training examples available for the 
correct answers associated with these instances is 
shown in parenthesis. For example, the first line 
shows that there were 59 noun instances that no sys- 
tem (of 23) could disambiguate, and that there were 
on average 16 training examples available for each 
of the correct senses for these 59 instances. 

Two very clear trends emerge. First, there are a 
substantial number of instances that are not disam- 
biguated correctly by any system (262 in English, 
228 in Spanish) and there are a large number of in- 
stances that are disambiguated by just a handful of 
systems. In the English lexical sample, there are 
1,277 test instances that are correctly disambiguated 
by five or fewer of the 23 systems. This is nearly 
30% of the test data, and confirms that this was a 
very challenging set of test instances. 

There is also a very clear correlation between the 
number of training examples available for a particu- 
lar sense of a word and the number of systems that 
are able to disambiguate instances of that word cor- 
rectly. For example, Table 6 shows that there were 
174 English verb instances that no system disam- 
biguated correctly. On average there were only 6 
training examples for the correct senses of these in- 
stances. However, there were 28 instances that all 
23 English systems were able to disambiguate. For 
these instances an average of 47 training examples 
were available for each correct sense. 

This correlation between instance difficulty and 
number of training examples may suggest that future 
Sense val exercises provide a minimum number of 
training examples for each sense, or adjust the scor- 
ing to reflect the difficulty of disambiguating a sense 
with very few training examples. 

Finally, we assess the difficulty associated with 
word types by calculating the average number of 
systems that were able to disambiguate the instances 
associated with that type. This information is pro- 
vided for the English and Spanish lexical samples in 
Tables \j\ and || Each word is shown with its part of 
speech, the number of test instances, and the average 



number of systems that were able to disambiguate 
each of the test instances. 

The verb collaborate is the easiest according to 
this metric in the English lexical sample. It has 30 
test instances that were disambiguated correctly by 
an average of 20.2 of the 23 systems. The verb find 
proves to be the most difficult, with 68 test instances 
disambiguated correctly by an average of 4.2 sys- 
tems. A somewhat less extreme range of values is 
observed for the Spanish lexical sample in Table ||. 
The adjective claw had 66 test instances that were 
disambiguated correctly by an average of 7.6 of the 
8 systems. The most difficult word was the verb con- 
ducir, which has 54 test instances that were disam- 
biguated correctly by an average of 3.8 systems. 

5 Conclusion 

This paper presents an analysis of the results from 
the English and Spanish lexical sample tasks of 
SENSEVAL-2. The analysis is based on the kappa 
statistic and a measure known as optimal combina- 
tion. It also assesses the difficulty of the test in- 
stances in these lexical samples. We find that there 
are a significant number of test instances that were 
not disambiguated correctly by any system, and that 
there is some correlation between instance difficulty 
and the number of available training examples. 
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